This project will examine the Prosper Loans Data by exmining the datset, visualising it and looking for trends/patterns and useful or interesting information. We carry out univariate, bivariate and multivariate exploration.
The dataset is found: https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv&sa=D&ust=1554486256021000
The dataset variables are explained here: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
df = pd.read_csv('prosperloandata.csv')
Copying the data - in case I need to refer to the original data if something goes wrong in the cleaning process
df_cle=df.copy()
Visually displaying the data to view how the table and columns are set up
df.head()
Use the describe method for this to show basic statistics for the numerical columns below:
df.describe()
#Looking at the datatypes
df.dtypes
Looking at the number of Rows and columns using the shape method below:
df.shape
Looking for duplicated listing keys below (as these should be unique values, with Unique key being the same value as the 'key' used in the listing object in the API) - There are 889.
df.ListingKey.duplicated().sum()
I will delete the duplicate values
df = df.drop_duplicates(subset="ListingKey")
df
Looking for Null values below
#Looking for Null values - there are
df.isna().sum().sum()
There are 1354429 - so I need to explore which columns have many null values in
df.isnull().sum().nlargest(20)
As we can see there are rows with lots of null values so I will get rid off any with more than 10% null values in.
#https://moonbooks.org/Articles/How-to-filter-missing-data-NAN-or-NULL-values-in-a-pandas-DataFrame-/
column_with_nan = df.columns[df.isnull().any()]
df_shape = df.shape
for column in column_with_nan:
print(column, df[column].isnull().sum())
for column in column_with_nan:
if df[column].isnull().sum()*100.0/df_shape[0] > 10:
df.drop(column,1, inplace=True)
df.head(3)
We can see that there are 63 columns so the 18 with more than 10% of null values have been successfully removed
#Looking at the datatypes
df.dtypes
df['LoanStatus'].unique()
#setting the size of the seaborn graphs
sns.set(rc={'figure.figsize':(18.7,13.27)})
Examining Loan Status: what does the distribution of loans look like?
sns.countplot(x="LoanStatus", data=df).set(title='LoanStatus');
df["LoanStatus"].value_counts()
We can see it's difficult to assess those 'Past Due' on this graph as they are so small in comparison, so I will make a separeate table fot them by filtering out values of less than 5,000.
counts = df["LoanStatus"].value_counts()
counts_df_above_5k = df.loc[df["LoanStatus"].isin(counts.index[counts > 5000])]
sns.countplot(x="LoanStatus", data=counts_df_above_5k).set(title='LoanStatus > 5000');
We can see that around 38k loans in the dataset have come to their natural end/are completed. Around 56k are 'Current' (I.e still being paid back). 5k defaulted (failed to pay back) and 12k are 'Chargedoff' (a charge-off or chargeoff is a declaration by a creditor that an amount of debt is unlikely to be collected).
Therefore around 15.3% of the total loans have either defaulted or paid less than the total value of the loan, with this number expected to increase, as the majority of loans are still 'current' so have the potential to not pay in full. </b>
We will now explore those below with below 5,000:
counts = df["LoanStatus"].value_counts()
counts_df_under_5k = df.loc[df["LoanStatus"].isin(counts.index[counts < 5000])]
sns.countplot(x="LoanStatus", data=counts_df_under_5k).set(title='LoanStatus < 5000');
#adding up all the values
800+361+311+304+265+ 203+16+ 5
counts_df_under_5k['LoanStatus'].value_counts()
We can see that for those past due, the majority(at 800) are only between 1-15 days, then there is a large reduction between this and the other Past Due dates, suggesting that most people who fall behind get out of it (either by paying or defaulting or being charged-off).
Examining Borrower APR
As the column is a decimal percentage value, I will multiply all values by 100 to get the absolute percentage value (I.e 20 instead of 0.2) and show it as a graph. And to do that I will use a function.
#defining the function
def percentage(original):
percent = (original*100).plot.hist()
plt.xlabel('Interest Rate')
plt.ylabel('Frequency')
plt.grid()
return percent
percentage(df["BorrowerAPR"])
df["BorrowerAPR"].mean()
df["BorrowerAPR"].median()
We can see that Borrower APR, which is defined as 'The Borrower's Annual Percentage Rate (APR) for the loan' is quite normally distributed roughly following a bell curve shape with a peak around 20%, which must be standard for the most common customer Prosper credit score. This is confirmed with both the mean and median being very close to this value(21.9% and 21.0%).
Now I will now explore Borrower Rate
percentage(df["BorrowerRate"])
df["BorrowerRate"].mean()
df["BorrowerRate"].median()
We can see that The Borrower's actual interest rate for this loan was very similar but around 5% less(as shown by the lower median and mean) on average. It is unclear why this differs from the data.
Exploring the Stated Monthly Income Variable
plt.figure(figsize=(15, 8))
sns.violinplot(data=df, x="StatedMonthlyIncome").set(title='Stated Monthly Income Violin Plot');
We can see from the violin plot that Stated Monthly income is a Log-normal distribution which shows that there are some extreme outliers in this data, but most are distributed closer together on the lower end of this scale. We should therefore be careful to remove the outliers when analysing this value later on.
Lender Yeild Vs Borrower APR: Is there a linear relationship between the interest charged to the borrower (Borrower APR) and the Lender Yeild (I.e profit off the loan). I would expect there to be as more interest means more income for the bank, but it could be offset by higher service fees.
df.plot(x ='LenderYield', y='BorrowerAPR', kind = 'scatter', figsize=(12,9))
plt.xlabel('BorrowerAPR')
# Set the y axis label of the current axis.
plt.ylabel('LenderYield')
#Set title
plt.title('BorrowerAPR Vs LenderYield')
plt.legend()
We can see that the higher the Borrower APR is, the higher the lender yeild is, suggesting that any rise in service fee is compensated for by the higher interest. However we will explore whether the higher interest rate is associated with higher defaults.
I have now noticed that the Estimated Loss column which was orignally moved due to high amounts of null values was actually jsut null on the completed loans, so I will now use the clean dataframe to analyse for the current loans if the higher APR, the higher the expected loss.
Estimated Loss Vs Borrower APR: Is there a linear relationship between the Estimated loss and Higher APR of the loan?
df = pd.read_csv('prosperloandata.csv')
df.plot(x ='EstimatedLoss', y='BorrowerAPR', kind = 'scatter', figsize=(12,9))
plt.xlabel('EstimatedLoss')
# Set the y axis label of the current axis.
plt.ylabel('BorrowerAPR')
#Set title
plt.title('EstimatedLoss Vs BorrowerAPR')
plt.legend()
There is a less linear relationship with estimated losses and Borrower APR, however it does look like there is some trend as we would expect the higher APR loans to be given to higher risk customers which would give a bigger overall loss (but this could have just as easily lead to more profit from the higher rate of those performing loans).
Estimated Return Vs Borrower APR
df.plot(x ='EstimatedReturn', y='BorrowerAPR', kind = 'scatter', figsize=(12,9))
plt.xlabel('EstimatedReturn')
# Set the y axis label of the current axis.
plt.ylabel('BorrowerAPR')
#Set title
plt.title('EstimatedReturn Vs BorrowerAPR');
There is no linear relationship with Estimated Return a Borrower's APR. This should be expected as those with a higher APR are more risky so default losses wipe out the higher profit from the Interest Rate, and the safer loans which do not defaault as much, this is factored in with a lower Interest Rate which wipes out the return from that. So overall the loans will be valued based on their risk which stops those with higher/lower interest rates having a higher/lower return.
Next I will visualise the Prosper Rating/Borrower APR Data. I would expect the Borrower APR to decrease as the rating goes up because they are less risky (and so given the cheaper rate).
plt.style.use("fivethirtyeight")
df_viz = df[["ProsperRating (numeric)", "BorrowerAPR"]].dropna()
plt.figure(figsize=(15, 8))
sns.histplot(data=df_viz, x="ProsperRating (numeric)", y="BorrowerAPR").set(title='ProsperRating (numeric) Vs BorrowerAPR');
We can see that this is the case as the darker parts are more 'dense' with more people there, and so we can see the downwards decrease in the overall bar and darker middle parts clearly on this graph.
Next I will look at how borrower APR is distributed across different Employment statuses. I would expect Employed and Full-time to have the lowest APR as they would in my opion be the less risky options.
df_viz = df[["BorrowerAPR", "EmploymentStatus"]].dropna()
sns.boxplot(data=df_viz, x="EmploymentStatus", y="BorrowerAPR").set(title='BorrowerAPR Vs Employment Status Boxplot');
We can see that Full-time and Employed do in fact have the lowest median APR, and they have a lower box plot showing left skewed distribution (I.e most of the values are below the mean) so have have lower APR's.
Full-time has the biggest interquartile range which we would expect as the most common type of employment so will likely have the widest range of values.</b>
Correlations: Which factors are correlated and what could the reason(s) be?
df.columns.tolist()
#Remove columns we don't need for correlations
corr_cols_to_remove = ["ListingKey", "ListingNumber", "ListingCreationDate", "LoanStatus",
"Occupation", "LoanKey", 'ListingCategory (numeric)']
df_corrs_pre = df.drop(corr_cols_to_remove, axis=1)
df_corrs = df_corrs_pre.corr()
df_corrs
#Showing in a heatmap
sns.heatmap(df_corrs, cmap="viridis", annot=True)
plt.title("Correlation Matrix");
There are too many correlations to view on the above heatmap, so I will take six that I am interested in and examine them instead.
interested_columns = ['CreditScoreRangeUpper', 'LoanOriginalAmount', 'CurrentCreditLines',
'ProsperRating (numeric)', 'BorrowerAPR', 'StatedMonthlyIncome']
interested_columns_corr = df[interested_columns].corr()
#Showing in a heatmap
sns.heatmap(interested_columns_corr, cmap="viridis", annot=True)
plt.title("Correlation Matrix");
We can see a significant relationship between the Numeric Prosper Rating which is negative with the borrows APR, so the higher the Rating, the lower the APR. This makes sense as the rating is essentially an internal credit rating of how good the borrower is. And so borrows that are more risky with a lower Prosper Rating are given higher Interest rates/APR's to compensate for taking on that risk.
There is a positive correlation with the customers upper credit score range and the Prosper Rating of 0.55 too, showing they are positively correlated suggesting that credit score could play an important role in a customers Prosper Rating.
We can see that the 'loan original amount' is also 0.43 correlated with the Prosper Rating, so perhaps customers with a higher rating are able to borrow more.
Surprsingly Stated Income was barely related suggesting Prosper doesn't consider it relevant or could expect customers to be dishonest in supplying that information.
Borrower APR has a negative correlation with Upper Credit Score again likely as these people are higher risk and so are charged more to borrow by banks.</b>
Next I will explore a linear Regression on the factors that I think will affect the Prosper Rating.
selected_columns_prosper = df[["ProsperRating (numeric)","IsBorrowerHomeowner", 'EmploymentStatus', 'EmploymentStatusDuration',
'DebtToIncomeRatio', 'StatedMonthlyIncome', 'CreditScoreRangeUpper']]
pros_rate_df = selected_columns_prosper.copy()
pros_rate_df[['TRUE','FALSE']] = pd.get_dummies(pros_rate_df['IsBorrowerHomeowner'])
pros_rate_df[['Self-employed', 'Employed', 'Not available', 'Full-time', 'Other',
'Not employed', 'Part-time', 'Retired']] = pd.get_dummies(pros_rate_df['EmploymentStatus'])
pros_rate_df.head()
pros_rate_df = pros_rate_df.drop(['EmploymentStatus', 'IsBorrowerHomeowner'], 1)
pros_rate_df = pros_rate_df.dropna()
pros_rate_df.head()
X = pros_rate_df.drop(['ProsperRating (numeric)'], axis=1)
Y = pros_rate_df['ProsperRating (numeric)']
X = sm.add_constant(X)
results1 = sm.OLS(Y, X).fit()
results1.summary()
Using linear regression analysis, we can see that all of these inputs other than 'Full-time' and 'Not available' for employment status are statistically significant (with a P value greater than 0.5) and therefore likely to affect the loan rating. The 0.350 Adjusted R-Squared Value suggests that these inputs account for approximately 35% of the Y variable/Prosper Rating (Numeric).
Next I will examine how different variables interact with each other, starting with the Prosper Rating and if they are a homeowner
plt.style.use("fivethirtyeight")
df_viz_ho = df[["ProsperRating (numeric)", "BorrowerAPR", "IsBorrowerHomeowner"]].dropna()
plt.figure(figsize=(15, 8))
sns.stripplot(data=df_viz_ho, x="ProsperRating (numeric)", y="BorrowerAPR", hue="IsBorrowerHomeowner", jitter=0.35, dodge=True).set(title='ProsperRating (numeric) Vs BorrowerAPR Vs IsBorrowerHomeowner');
We can see from the stripplot graph showing the density of homeowners Vs non-homeowners that for the lowest Rating's 1 and 2 and higher Borrower APR, there are less red dot density, and therefore less homeowners which would make sense as homeowners are seen as less risky and so would be given a lower score (and pay higehr interest). And at the top end of the scale at 6 and especially 7 we can see from higher density that more are Homeowners which makes sense as we would expect them to be the safest customers with higher incomes and less debt. However throughout the rest of the scale in the middle sections it is not so clear so the other factors (e.g. employment status, credit score etc.) are more important. We can also see the interaction between Borrower APR decreasing with Prosper Rating as the customer is considered less risky.
Next I will explore how whether a customer's employment status interacts with the Prosper Rating and Borrow APR
df_viz = df[["ProsperRating (numeric)", "BorrowerAPR", "EmploymentStatus"]].dropna()
plt.style.use("fivethirtyeight")
plt.figure(figsize=(15, 8))
sns.barplot(data=df_viz, x="ProsperRating (numeric)", y="BorrowerAPR", hue="EmploymentStatus").set(title='ProsperRating (numeric) Vs Borrower APR Vs Employment Status');
We can see that 'Self Employed', 'Part-time' and 'Retired' are always the higher APR for each Prosper Rating, which makes sense as these would be more risky than people in more stable full-time employment with a regular source of income.
Next I will explore how whether a customer's stated monthly income interacts with the Loan Amount and their employemnt status. I would expect to see higher income leading to higher loan amounts as they will generally be buying more expensie things with them (for example bigger houses, better cars).
plt.figure(figsize=(15, 8))
inc_df = df[df["StatedMonthlyIncome"] < 250000] # Exclude some outliers
sns.scatterplot(data=inc_df, x="LoanOriginalAmount", y="StatedMonthlyIncome", hue="EmploymentStatus").set(title='LoanOriginalAmount Vs StatedMonthlyIncome Vs Employment Status');
This is quite 'messy' on the chart and difficult to interperate as it's overplotted.
Also there are several catergories such as 'Full-time' and 'Employed' which are overlapping. Therefore I will attempy to make an 'Employed' category for people working full time. I am assuming self-employed people work full time which may not be the case though so results should be taken with caution. </b>
income= df[["LoanOriginalAmount", "StatedMonthlyIncome", "EmploymentStatus"]].dropna()
income = income[income["StatedMonthlyIncome"] < 250000] # Exclude some outliers
income["EmploymentStatus"].unique().tolist()
## Condition - either self-employed OR full-time
condition = (income["EmploymentStatus"] == "Self-employed") | (income["EmploymentStatus"] == "Full-time")
## Overwrite the existing column with 'Employed' if either previous condition
income["EmploymentStatus"] = np.where(condition, "Employed", income["EmploymentStatus"])
#check it's worked
income["EmploymentStatus"].value_counts()
plt.figure(figsize=(15, 8))
sns.scatterplot(data=income, x="LoanOriginalAmount", y="StatedMonthlyIncome", hue="EmploymentStatus").set(title='LoanOriginalAmount Vs StatedMonthlyIncome Vs Employment Status');
I have solved the Employement Overlapping problem but again it is difficult to interperate as it is overplotted - therefore I will first turn the loan amounts into categorical variables of Stated Monthly Income to reduce the plots but still show the overall pattern.
#Create a columns splitting the Loan Origional Amount into intervals.
income['Loan Amount'] = income['LoanOriginalAmount'].apply(lambda x: (x//5000)*5000).astype(int)
plt.figure(figsize=(15,8))
#omit top 0.5% of stated monthly income to remove outliers
sns.stripplot(data=income.query('StatedMonthlyIncome < StatedMonthlyIncome.quantile(0.995)'),
x="Loan Amount", y="StatedMonthlyIncome", hue="EmploymentStatus", dodge=True, jitter=0.35).set(title='LoanOriginalAmount Vs StatedMonthlyIncome Vs Employment Status');
Now we can see that the vast majority of people are in employment who have loans, and other than a few 'non-available' they generally have the higher stated income which we would expect with job income.
Also although there is employed people all across the Loan amount, in the higher sections such as from $25,000 onwards it is almost entirely employed people, so perhaps unemployed/retired/part-time only people are blocked from getting higher amounts, or it counts significantly in the decision to give someone a loan, which we would expect as full-time employment is considered more reliable/safe. </b>